{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3f0f8e8c-7372-4107-a92a-6fa90ce1713d",
   "metadata": {},
   "source": [
    "# Web Scraper & Summarizer\n",
    "\n",
    "A tiny demo that fetches text from a public webpage, breaks it into chunks, and uses an OpenAI model to produce a concise summary with bullet points.\n",
    "\n",
    "**Features**\n",
    "\n",
    "* Fetches static pages (`requests` + `BeautifulSoup`) and extracts headings/paragraphs.\n",
    "* Hierarchical summarization: chunk → chunk-summaries → final summary.\n",
    "* Simple, configurable prompts and safe chunking to respect model limits.\n",
    "\n",
    "**Quick run**\n",
    "\n",
    "1. Add `OPENAI_API_KEY=sk-...` to a `.env` file.\n",
    "2. `pip install requests beautifulsoup4 python-dotenv openai`\n",
    "3. Run the script/notebook and set `url` to the page you want.\n",
    "\n",
    "**Note**: Use for public/static pages; JS-heavy sites need Playwright/Selenium.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "ddd58a2c-b8d1-46ef-9b89-053c451f28cf",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: requests in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (2.32.5)\n",
      "Requirement already satisfied: beautifulsoup4 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (4.13.5)\n",
      "Requirement already satisfied: python-dotenv in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (1.1.1)\n",
      "Requirement already satisfied: openai in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (1.107.2)\n",
      "Requirement already satisfied: charset_normalizer<4,>=2 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from requests) (3.4.3)\n",
      "Requirement already satisfied: idna<4,>=2.5 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from requests) (3.10)\n",
      "Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from requests) (2.5.0)\n",
      "Requirement already satisfied: certifi>=2017.4.17 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from requests) (2025.8.3)\n",
      "Requirement already satisfied: soupsieve>1.2 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from beautifulsoup4) (2.8)\n",
      "Requirement already satisfied: typing-extensions>=4.0.0 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from beautifulsoup4) (4.15.0)\n",
      "Requirement already satisfied: anyio<5,>=3.5.0 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from openai) (4.10.0)\n",
      "Requirement already satisfied: distro<2,>=1.7.0 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from openai) (1.9.0)\n",
      "Requirement already satisfied: httpx<1,>=0.23.0 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from openai) (0.28.1)\n",
      "Requirement already satisfied: jiter<1,>=0.4.0 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from openai) (0.10.0)\n",
      "Requirement already satisfied: pydantic<3,>=1.9.0 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from openai) (2.11.7)\n",
      "Requirement already satisfied: sniffio in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from openai) (1.3.1)\n",
      "Requirement already satisfied: tqdm>4 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from openai) (4.67.1)\n",
      "Requirement already satisfied: httpcore==1.* in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from httpx<1,>=0.23.0->openai) (1.0.9)\n",
      "Requirement already satisfied: h11>=0.16 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from httpcore==1.*->httpx<1,>=0.23.0->openai) (0.16.0)\n",
      "Requirement already satisfied: annotated-types>=0.6.0 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from pydantic<3,>=1.9.0->openai) (0.7.0)\n",
      "Requirement already satisfied: pydantic-core==2.33.2 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from pydantic<3,>=1.9.0->openai) (2.33.2)\n",
      "Requirement already satisfied: typing-inspection>=0.4.0 in /Users/gokturkberkekorkut/anaconda3/envs/llms/lib/python3.11/site-packages (from pydantic<3,>=1.9.0->openai) (0.4.1)\n",
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "%pip install requests beautifulsoup4 python-dotenv openai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "4d027b2c-6663-4234-b364-a252b2a43cef",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "API Key prefix: sk-proj-lL\n"
     ]
    }
   ],
   "source": [
    "from dotenv import load_dotenv\n",
    "import os\n",
    "import openai\n",
    "\n",
    "load_dotenv()  # loads variables from .env into the environment\n",
    "openai.api_key = os.getenv(\"OPENAI_API_KEY\")\n",
    "\n",
    "if not openai.api_key:\n",
    "    raise ValueError(\"OPENAI_API_KEY not found. Please create a .env file with OPENAI_API_KEY=<your_key>\")\n",
    "else:\n",
    "    print(\"API Key prefix:\", openai.api_key[:10])  # show only prefix for safety"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "c4928820-abaa-4b44-b506-c053ebc447f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# This function extracts common text tags from a static page.\n",
    "import requests\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "def fetch_text_from_url(url, max_items=300, timeout=15):\n",
    "    \"\"\"\n",
    "    Fetch the page using requests and extract text from common tags.\n",
    "    Returns a single string containing the joined text blocks.\n",
    "    \"\"\"\n",
    "    resp = requests.get(url, timeout=timeout)\n",
    "    resp.raise_for_status()\n",
    "    soup = BeautifulSoup(resp.text, \"html.parser\")\n",
    "\n",
    "    items = []\n",
    "    for tag in soup.find_all([\"h1\", \"h2\", \"h3\", \"p\", \"li\"], limit=max_items):\n",
    "        text = tag.get_text(\" \", strip=True)\n",
    "        if text:\n",
    "            items.append(text)\n",
    "    return \"\\n\\n\".join(items)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "cbd5d304-51b5-4d15-b4ce-31897adc03a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# chunk_text: split long text into manageable pieces\n",
    "# summarize_chunk: call OpenAI model to summarize one chunk\n",
    "# hierarchical_summarize: summarize chunks then combine summaries into a final summary\n",
    "\n",
    "import time\n",
    "\n",
    "def chunk_text(text, max_chars=3000):\n",
    "    \"\"\"\n",
    "    Simple character-based chunking.\n",
    "    Try to cut at paragraph or sentence boundaries when possible.\n",
    "    \"\"\"\n",
    "    chunks = []\n",
    "    start = 0\n",
    "    text_len = len(text)\n",
    "    while start < text_len:\n",
    "        end = start + max_chars\n",
    "        if end < text_len:\n",
    "            # Prefer to cut at a blank line or sentence end\n",
    "            cut = text.rfind(\"\\n\\n\", start, end)\n",
    "            if cut == -1:\n",
    "                cut = text.rfind(\". \", start, end)\n",
    "            if cut == -1:\n",
    "                cut = end\n",
    "            end = cut\n",
    "        chunk = text[start:end].strip()\n",
    "        if chunk:\n",
    "            chunks.append(chunk)\n",
    "        start = end\n",
    "    return chunks\n",
    "\n",
    "def summarize_chunk(chunk, system_prompt=None, model=\"gpt-4o-mini\", temperature=0.2):\n",
    "    \"\"\"\n",
    "    Summarize a single chunk using the OpenAI chat completions API.\n",
    "    Returns the model's text output.\n",
    "    \"\"\"\n",
    "    if system_prompt is None:\n",
    "        system_prompt = \"You are a concise summarizer. Produce a short (~100 words) summary and 3 bullet points.\"\n",
    "\n",
    "    messages = [\n",
    "        {\"role\": \"system\", \"content\": system_prompt},\n",
    "        {\"role\": \"user\", \"content\": f\"Summarize the following text concisely. Keep it short.\\n\\nTEXT:\\n{chunk}\"}\n",
    "    ]\n",
    "\n",
    "    resp = openai.chat.completions.create(\n",
    "        model=model,\n",
    "        messages=messages,\n",
    "        temperature=temperature,\n",
    "    )\n",
    "    return resp.choices[0].message.content\n",
    "\n",
    "def hierarchical_summarize(text, max_chunk_chars=3000, model=\"gpt-4o-mini\"):\n",
    "    \"\"\"\n",
    "    1) Split the text into chunks\n",
    "    2) Summarize each chunk\n",
    "    3) Combine chunk summaries and ask model for a final concise summary\n",
    "    \"\"\"\n",
    "    chunks = chunk_text(text, max_chars=max_chunk_chars)\n",
    "    print(f\"[info] {len(chunks)} chunk(s) created.\")\n",
    "    chunk_summaries = []\n",
    "    for i, c in enumerate(chunks, 1):\n",
    "        print(f\"[info] Summarizing chunk {i}/{len(chunks)} (chars={len(c)})...\")\n",
    "        s = summarize_chunk(c, model=model)\n",
    "        chunk_summaries.append(s)\n",
    "        time.sleep(0.5)  # small delay to avoid hitting rate limits\n",
    "\n",
    "    if len(chunk_summaries) == 1:\n",
    "        return chunk_summaries[0]\n",
    "\n",
    "    combined = \"\\n\\n---\\n\\n\".join(chunk_summaries)\n",
    "    final_prompt = \"You are a concise summarizer. Combine the following chunk summaries into one final summary of about 150 words and 5 bullet points.\"\n",
    "    final_messages = [\n",
    "        {\"role\": \"system\", \"content\": final_prompt},\n",
    "        {\"role\": \"user\", \"content\": combined}\n",
    "    ]\n",
    "    resp = openai.chat.completions.create(\n",
    "        model=model,\n",
    "        messages=final_messages,\n",
    "        temperature=0.2,\n",
    "    )\n",
    "    return resp.choices[0].message.content\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "9a23facd-4abe-4981-bd94-b14f5a61c8fe",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[info] Fetching page: https://www.basketball-reference.com/\n",
      "[info] Fetched text length: 11778\n",
      "[info] Running hierarchical summarization...\n",
      "[info] 5 chunk(s) created.\n",
      "[info] Summarizing chunk 1/5 (chars=2430)...\n",
      "[info] Summarizing chunk 2/5 (chars=2460)...\n",
      "[info] Summarizing chunk 3/5 (chars=2426)...\n",
      "[info] Summarizing chunk 4/5 (chars=2467)...\n",
      "[info] Summarizing chunk 5/5 (chars=1987)...\n",
      "\n",
      "\n",
      "=== FINAL SUMMARY ===\n",
      "\n",
      "Sports Reference is a comprehensive platform for sports statistics and history, particularly focusing on basketball, baseball, football, hockey, and soccer. It offers tools like Stathead for advanced data analysis and the Immaculate Grid for interactive gameplay. Users can access player stats, team standings, and historical records without ads. \n",
      "\n",
      "- Extensive stats available for NBA, WNBA, G League, and international leagues.\n",
      "- Daily recaps of NBA and WNBA performances delivered via email.\n",
      "- Stathead Basketball provides in-depth stats with a free first month for new subscribers.\n",
      "- Upcoming events include the NBA All-Star Weekend (February 13-15, 2026) and the start of the NBA season (October 21, 2026).\n",
      "- Features include trivia games, a blog, and resources for sports writers, enhancing user engagement and knowledge.\n"
     ]
    }
   ],
   "source": [
    "# Change the URL to any static (non-JS-heavy) page you want to test.\n",
    "if __name__ == \"__main__\":\n",
    "    url = \"https://www.basketball-reference.com/\"  # replace with your chosen URL\n",
    "    print(\"[info] Fetching page:\", url)\n",
    "    page_text = fetch_text_from_url(url, max_items=300)\n",
    "    print(\"[info] Fetched text length:\", len(page_text))\n",
    "\n",
    "    print(\"[info] Running hierarchical summarization...\")\n",
    "    final_summary = hierarchical_summarize(page_text, max_chunk_chars=2500)\n",
    "    print(\"\\n\\n=== FINAL SUMMARY ===\\n\")\n",
    "    print(final_summary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5d6eb1e-a58b-4487-b04c-fe5a382121a4",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
